Manual CNV analysis
Contents
Manual CNV analysis¶
Objective
This vignette describes how copy number variation (CNV) can
be identified using Mosaic. Data filtering, normalization and
ploidy computation are described.
import missionbio.mosaic as ms
sample = ms.load_example_dataset('3 cell mix')
Loading, <_io.BytesIO object at 0x7fa43b793f90>
Loaded in 0.4s.
This is an analyzed h5 file. Hence the clones and clusters
are already labeled. It contains three cell lines KG-1,
Tom-1, and Jurkat. It also contains doublets of each pair
as seen in the heatmap.
For your h5 file, identify the clones using the DNA variants
before proceeding with this vignette.
fig = sample.dna.heatmap('NGT')
fig
Filtering amplicons¶
We will only be working with the DNA and CNV assays.
The CNV object contains 138 amplicons. But not all
of them worked as expected. We will drop amplicons
which worked in less than half the total cells.
# This returns the reads in each cell and amplicon
reads = sample.cnv.get_attribute('read_counts', constraint='row+col')
reads
| AMLV2_AMP1 | AMLV2_AMP2 | AMLV2_AMP3 | AMLV2_AMP4 | AMLV2_AMP5 | AMLV2_AMP6 | AMLV2_AMP7 | AMLV2_AMP8 | AMLV2_AMP9 | AMLV2_AMP10 | ... | chr10_106721610 | chr10_5554293 | chr10_77210191 | chr14_56969005 | chr16_55770629 | chr16_8569820 | chr18_9750662 | chr6_17076840 | chr6_40116264 | chr6_62094287 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| read_counts | |||||||||||||||||||||
| AACAACCTAAACTTGTCG | 139 | 161 | 105 | 219 | 92 | 157 | 152 | 127 | 74 | 178 | ... | 478 | 169 | 184 | 226 | 201 | 123 | 332 | 127 | 304 | 218 |
| AACAACTGGTACGTTGGA | 82 | 64 | 66 | 87 | 9 | 105 | 74 | 53 | 43 | 94 | ... | 255 | 132 | 100 | 134 | 124 | 31 | 155 | 126 | 173 | 102 |
| AACAATGCAAGACCACGC | 110 | 130 | 54 | 131 | 24 | 47 | 70 | 28 | 52 | 99 | ... | 419 | 32 | 37 | 181 | 139 | 98 | 170 | 322 | 306 | 102 |
| AACAATGCACTGACGGAT | 124 | 125 | 93 | 163 | 25 | 148 | 164 | 93 | 65 | 177 | ... | 495 | 58 | 22 | 124 | 233 | 78 | 195 | 197 | 307 | 214 |
| AACAATGCAGAGCCATCC | 50 | 59 | 31 | 39 | 14 | 51 | 49 | 16 | 29 | 82 | ... | 122 | 18 | 71 | 63 | 51 | 42 | 151 | 134 | 114 | 118 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| TTGTATCACGCTGTTCGG | 38 | 17 | 16 | 5 | 3 | 67 | 22 | 11 | 12 | 27 | ... | 75 | 28 | 63 | 14 | 69 | 7 | 80 | 99 | 72 | 29 |
| TTGTCAACCGCTTGTCCT | 104 | 71 | 48 | 82 | 0 | 61 | 64 | 30 | 17 | 60 | ... | 187 | 25 | 127 | 86 | 96 | 48 | 74 | 134 | 153 | 104 |
| TTGTCAACCTACAACACC | 162 | 88 | 22 | 39 | 14 | 128 | 44 | 22 | 36 | 71 | ... | 379 | 53 | 132 | 122 | 235 | 125 | 294 | 153 | 345 | 82 |
| TTGTCAACCTAGTAACGG | 140 | 107 | 42 | 157 | 52 | 149 | 89 | 46 | 34 | 115 | ... | 189 | 96 | 142 | 177 | 153 | 120 | 265 | 207 | 274 | 89 |
| TTGTTAGAGATCAGGATG | 60 | 57 | 39 | 76 | 13 | 32 | 69 | 27 | 35 | 57 | ... | 267 | 10 | 87 | 55 | 104 | 82 | 95 | 111 | 151 | 96 |
2476 rows × 135 columns
# Only amplicons found in more than half the cells are analyzed
# The other amplicons are dropped.
working_amplicons = (reads.median() > 0).values
sample.cnv = sample.cnv[:, working_amplicons]
# We are now left with 135 amplicons
sample.cnv.shape
(2476, 135)
Normalization and ploidy computation¶
Finding the ploidy of the cells can be split into
two distinct calculations.
Normalization
This step is required to correct for systemic artefacts.
Cell to cell, and amplicon to amplicon variations are corrected.
Each cell is normalized to it’s total reads and each amplicon
to its median counts. This is performed using thenormalize_reads
method, which adds thenormalized_countslayer to the CNV object.Correcting the baseline
Not all amplicons show copy loss or copy gain. In the normalization
step all amplicons in a cell were normalized using the same value
i.e. the total reads in the cell. This results in slight deviations
from the true ploidy value of the cell for each amplicon.To correct for this, a clone which is known to be diploid is identified
and all the other cells are normalized to the counts of this clone. This
step is done using thecompute_ploidymethod, and providing it
with the barcodes known to be diploid. This adds theploidylayer
to the CNV object.
# Reads are normalized to correct for systemic artefacts
sample.cnv.normalize_reads()
# Assuming TOM-1 cells are diploid for all the amplicons,
# we can compute the ploidy of the other cell lines as follows
sample.cnv.compute_ploidy(diploid_cells=sample.dna.barcodes('TOM-1'))
Visualizations¶
The ploidy values are stored in the ploidy layer
These can be visualized using a heatmap or a ploidy line plot for each clone.
# Assign the DNA labels to the CNV object
# We want to ensure the labels are the same
sample.cnv.set_labels(sample.dna.get_labels())
sample.cnv.set_palette(sample.dna.get_palette())
Heatmap¶
# Heatmap with the features ordered by the chromosome position
fig = sample.cnv.heatmap('ploidy', features='positions')
fig
# Heatmap for a subset of the chromosomes
fig = sample.cnv.heatmap('ploidy', features=['4', '7', '20'])
fig
# Heatmap with the features grouped by the genes
# The first time this runs, it fetches the gene names from ensembl
# The annotation can also be fetched using using sample.cnv.get_gene_names()
fig = sample.cnv.heatmap('ploidy', features='genes')
fig
# Heatmap for a subset of the genes
fig = sample.cnv.heatmap('ploidy', features=['EZH2', 'TET2', 'TP53'])
fig
Line plot¶
As seen in the heatmap, KG-1 has a copy loss for some
of the amplicons in the panel. TOM-1 does not show much
deviation from ploidy 2.
This can also be visualized by the per amplicon ploidy plot.
This plot is generated for each clone separately.
# KG-1 shows copy loss at two different locations
fig = sample.cnv.plot_ploidy('KG-1')
fig
# KG-1 shows copy loss at two different locations
fig = sample.cnv.plot_ploidy('KG-1')
fig
# Jurkat is diploid for all the amplicons
fig = sample.cnv.plot_ploidy('Jurkat')
fig
# Since TOM-1 was used to set the baseline,
# it is perfectly diploid for all amplicons
fig = sample.cnv.plot_ploidy('TOM-1')
fig